Introduction

Starting during World War II and the Second Wave of the feminist movement, women have been entering the workforce in greater numbers. As of 2015, women made up almost half of the American workforce at 46.8%. Despite this almost equal participation between men and women, women still earn less than men on average.

The gender wage gap has been and still is a contentious public issue. While there is no debate that women earn less than men, the key aspect of the debate has been this question:

Is there a significant difference in income between men and women and does this difference vary depending on other factors?

This is an important question to answer as, if it turns out that we cannot account for the difference appropriately, it could provide quantitative evidence that gender discrimination plays a significant role in causing the wage gap.

On the other hand, if it turns out that we can account for the difference by other causes, then this would suggest that there is not quantitative evidence supporting gender discrimination as a cause of the wage gap. It should be noted that such a finding is not the same thing as suggesting that gender discrimination does not influence the wage gap at all, only that other factors may play a relatively more significant role.

In our attempt to answer this question, we decided to look into these variables (and their relationship with gender and income) as possible other causes:

Data Cleaning

We will be using the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. This data set contains survey responses from thousands of individuals who have been surveyed every one or two years starting in 1997.

In the process of studying our variables, we noticed several issues that needed to be rectified prior to conducting both graphical/tabular summaries and the regression analysis.

Alternative Responses

All of the variables had several kinds of alternative responses. These included: Refusal, Don’t Know, Valid Skip, Invalid Skip, and Non-Interview. While we attempted to recode these responses in order to include as much data as we could in our analysis, much of the time this was not possible as there was no clear way to interpret the data. As such, we ended up removing many data points. We removed all alternative responses for these variables:

  • Income
  • Highest degree
  • Occupation
  • Industry
  • Childhood financial difficulty
  • Marital status
  • Urban vs. rural
  • Spouse income

We removed all the alternative responses except valid skips for these variables:

  • Disability
    • Valid skip was recoded as “No”, i.e., they have not struggled with the aforementioned issue
  • Total children
    • Valid skip was recoded as “0” because these respondents did not have biological children -Age Incarcerated
    • Valid skips were recoded to never incarcerated

Small Samples

Furthermore, there were several variables that contained factors that had less than 10 in their sample or no women. As such, we decided to exclude them as their small sample sizes would prevent them from providing a clear picture of the average income within those groups. These included these factors from these variables:

  • Occupation
    • Those working as life, physical, and social science technicians
    • Those working in military specific occupations
    • Those working in ACS special codes
    • Those working in engineering or related technicians
    • Those working in entertainment attendants and related workers
    • Those working in farming, fishing, and forestry
    • Those working in food preparation
  • Marital status
    • Widowers
  • Total children
    • Those who have 6 children
    • Those who have 7 children
    • Those who have 8 children
  • Highest degree
    • Those who received a PhD
  • Industry
    • Those working in mining
    • Those working in utilities
    • Those working in agriculture, forestries, and fisheries

Topcoding

In our dataset, the income variable has been topcoded for the top 2% of earners. This means that instead of each earner in the top 2% having their own unique data point (as is the case for the remaining 98%), they all instead share the average of their group which is $235884. This can pose some issues for both the graphical/tabular summaries and the regression analysis as it can skew the data. As such, throughout both of these sections, we have reviewed the results both with and without the topcoding and then decided which results to ultimately present. For our graphical summaries, we noticed that removing topcoded values gave us a clearer understanding of our results. However, we did not notice a significant difference in our regression analysis regardless of whether we did or did not include topcoded values.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 57202.82 779.271 73.406 0
as.factor(sex)2 -15923.90 1118.772 -14.233 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56108.74 1003.466 55.915 0
as.factor(sex)Female -14354.72 1387.154 -10.348 0

The spouse income variable also had topcoding at the 2% level. See the spouse income section under graphical and tabular summaries to see how this was handled.

Graphical and Tabular Summaries

In order to better understand how each variable relates to income and gender, we have created several graphical and tabular summaries for each of them.

Highest degree

We see that men are more likely than women to have finished their education at lower levels, such as None and GED. In turn, women are more likely than men to have Associate/Junior college degrees, Bachelor’s Degrees, Master’s Degrees and Professional Degrees (DDS, JD, MD). However, there is some uncertainty as these differences may not be great enough to be statistically significant.

Highest Degree Gender Count Average Income
None Male 47 37961.70
None Female 24 22312.50
High School Diploma Male 314 48866.85
High School Diploma Female 326 31655.10
GED Male 77 41310.73
GED Female 52 26876.69
Associate/Junior College Male 60 58579.73
Associate/Junior College Female 68 43258.82
Bachelor’s Degree Male 211 68167.04
Bachelor’s Degree Female 293 51839.94
Master’s Degree Male 53 79814.60
Master’s Degree Female 72 59119.94
Professional Degree (DDS, JD, MD) Male 5 119577.00
Professional Degree (DDS, JD, MD) Female 7 73842.86

Men have a higher average salary than women across every educational group.

Occupation

Occupation Gender Count Average Income
OFFICE AND ADMINISTRATIVE SUPPORT WORKERS Male 62 50785.05
OFFICE AND ADMINISTRATIVE SUPPORT WORKERS Female 162 35095.52
EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL Male 105 69064.24
EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL Female 105 56774.56
MANAGEMENT RELATED Male 52 74116.77
MANAGEMENT RELATED Female 59 57536.59
MATHEMATICAL AND COMPUTER SCIENTISTS Male 45 69774.00
MATHEMATICAL AND COMPUTER SCIENTISTS Female 11 70272.73
ENGINEERS, ARCHITECTS, AND SURVEYORS Male 17 83147.06
ENGINEERS, ARCHITECTS, AND SURVEYORS Female 2 101500.00
PHYSICAL SCIENTISTS Male 6 50750.00
PHYSICAL SCIENTISTS Female 5 44000.00
SOCIAL SCIENTISTS AND RELATED WORKERS Male 3 61333.33
SOCIAL SCIENTISTS AND RELATED WORKERS Female 7 59397.00
COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS Male 14 49907.14
COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS Female 29 48961.14
LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS Male 5 74600.00
LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS Female 8 52250.00
TEACHERS Male 27 50851.85
TEACHERS Female 91 39624.18
EDUCATION, TRAINING, AND LIBRARY WORKERS Male 4 50000.00
EDUCATION, TRAINING, AND LIBRARY WORKERS Female 11 20272.73
ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS Male 11 50909.09
ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS Female 13 58315.38
MEDIA AND COMMUNICATION WORKERS Male 6 67166.67
MEDIA AND COMMUNICATION WORKERS Female 14 39363.43
HEALTH DIAGNOSIS AND TREATING PRACTITIONERS Male 9 75542.78
HEALTH DIAGNOSIS AND TREATING PRACTITIONERS Female 58 60834.48
HEALTH CARE TECHNICAL AND SUPPORT Male 15 44600.00
HEALTH CARE TECHNICAL AND SUPPORT Female 53 28562.47
PROTECTIVE SERVICE Male 35 65165.91
PROTECTIVE SERVICE Female 11 31590.91
FOOD PREPARATIONS AND SERVING RELATED Male 25 28512.00
FOOD PREPARATIONS AND SERVING RELATED Female 33 25348.48
CLEANING AND BUILDING SERVICE Male 17 44529.41
CLEANING AND BUILDING SERVICE Female 12 14166.67
PERSONAL CARE AND SERVICE WORKERS Male 8 35625.00
PERSONAL CARE AND SERVICE WORKERS Female 46 22007.98
SALES AND RELATED WORKERS Male 65 55674.63
SALES AND RELATED WORKERS Female 77 38892.27
CONSTRUCTION TRADES AND EXTRACTION WORKERS Male 68 50305.82
CONSTRUCTION TRADES AND EXTRACTION WORKERS Female 3 26166.67
INSTALLATION, MAINTENANCE, AND REPAIR WORKERS Male 51 49820.88
INSTALLATION, MAINTENANCE, AND REPAIR WORKERS Female 3 55333.33
PRODUCTION AND OPERATING WORKERS Male 15 45266.67
PRODUCTION AND OPERATING WORKERS Female 5 29400.00
SETTER, OPERATORS, AND TENDERS Male 30 47100.00
SETTER, OPERATORS, AND TENDERS Female 13 36153.85
TRANSPORTATION AND MATERIAL MOVING WORKERS Male 72 42699.72
TRANSPORTATION AND MATERIAL MOVING WORKERS Female 11 23909.09

Men on average have a higher income than women in all occupations except:

  • Engineers, architects and surveyors (though this small sub-sample only has 21 respondents)
  • Entertainers and performers, sports and related workers (though this small sub-sample only has 24 respondents)
  • Installation, maintenance, and repair workers (though there are only 3 women in this occupation)
  • Mathematical and computer scientists (though the difference is slight)

It should be noted that the reasons provided in parentheses suggest there is statistical uncertainty in said comparisons.

Furthermore, we see that men and women, on average, tend to choose different careers. These differences appear to be statistically significant for certain careers. Men appear to be more prevalent in:

  • Construction trades and extraction workers
  • Installation, maintenance, and repair workers
  • Mathematical and computer scientists
  • Production and operating workers
  • Protective service
  • Setter, operators, and tenders
  • Transportation and material moving workers

While women are more prevalent in:

  • Counselors, social and religious workers
  • Healthcare technical and support
  • Health diagnosis and treating practitioners
  • Office and Administrative support workers
  • Personal care and service workers
  • Teachers

In all other occupations, men and women appear to be present in around the same numbers.

As men and women tend to choose different career paths, it seems that women are more likely to be in occupations with lower salaries while men are more likely to be in occupations with higher salaries. For example, in the occupations of Office and Administrative Support Workers and Personal Care and Support Workers, both of which women tend to be prevalent in, these have average salaries of $57137.11 and $26712.07, respectively. On the other hand, men are more prevalent in the occupations of Construction Trades and Extraction Workers and Mathematical and Computer Scientists which have average salaries of $62919.4 and $62810.22, respectively.

Industry

Industry Gender Count Average Income
EDUCATIONAL, HEALTH, AND SOCIAL SERVICES Male 94 54855.04
EDUCATIONAL, HEALTH, AND SOCIAL SERVICES Female 348 40014.54
CONSTRUCTION Male 92 52676.85
CONSTRUCTION Female 12 39125.00
MANUFACTURING Male 92 52863.33
MANUFACTURING Female 43 48290.70
WHOLESALE TRADE Male 33 59909.09
WHOLESALE TRADE Female 11 49636.36
RETAIL TRADE Male 54 46609.80
RETAIL TRADE Female 66 31904.55
TRANSPORTATION AND WAREHOUSING Male 45 50041.78
TRANSPORTATION AND WAREHOUSING Female 13 47153.85
INFORMATION AND COMMUNICATION Male 11 57590.91
INFORMATION AND COMMUNICATION Female 12 53416.67
FINANCE, INSURANCE, AND REAL ESTATE Male 60 66600.27
FINANCE, INSURANCE, AND REAL ESTATE Female 71 50113.69
PROFESSIONAL AND RELATED SERVICES Male 106 58773.08
PROFESSIONAL AND RELATED SERVICES Female 84 46411.90
ACS SPECIAL CODES Male 42 74471.02
ACS SPECIAL CODES Female 42 59523.81
ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES Male 49 37975.51
ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES Female 69 27030.41
OTHER SERVICES Male 28 48131.18
OTHER SERVICES Female 28 37039.29
PUBLIC ADMINISTRATION Male 61 69308.31
PUBLIC ADMINISTRATION Female 43 44680.40

According to the above table, men on average have a higher income than women in all industries. To see which of these differences in average incomes are statistically significant, see the corresponding graph in the Findings from Regression Analysis section.

Men and women also tend to choose different industries in some instances. The industries that have statistically significant greater amount of men are:

  • Construction
  • Manufacturing
  • Professional and Related Services
  • Public Administration
  • Transportation and Warehousing
  • Wholesale Trade

Women are more prevalent in:

  • Educational, Health, and Social Services

In all other industries, men and women appear to be present in around the same numbers.

Similar to occupations, women seem to be more likely than men to be in less lucrative industries. While the female-dominant industry of educational, health, and social services has an average income of around $51406.86, the male-dominant industries of construction and manufacturing tend to have average salaries of $51113.17 and $55413.04.

Disability

Disability? Gender Count Average Income
No Male 722 56526.67
No Female 803 42215.67
Yes Male 45 49403.33
Yes Female 39 32248.77

Men appear to be more likely than women to have (or have ever had) some sort of disability. This difference does not appear to be statistically significant as the difference is only 6.

Disability? Count Average Income
No 1525 48991.11
Yes 84 41438.71

While those without disabilities do appear to on average have higher incomes than those with, this difference does not hold when including sex. For example, men who have (or have ever had) disabilities have a higher average income than women who have never had a disability.

Childhood financial difficulty

Childhood financial difficulty? Gender Count Average Income
No Male 726 56904.14
No Female 813 41999.24
Yes Male 41 42024.39
Yes Female 29 34879.31

Men appear to be more likely than women to have had childhood financial difficulties. However, this difference does not appear to be statistically significant as the difference is only 12.

Childhood financial difficulty? Count Average Income
No 1539 49030.40
Yes 70 39064.29

Those without childhood financial difficulties appear to on average have higher incomes than those with. This difference does hold when including sex. For example, men who have had childhood financial difficulties have a lower average income than women who have never had childhood financial difficulties. However, this difference is likely statistically insignificant as the difference is only $25.

Incarceration Age

Incarceration Age Sex Count Average Income
16 TO 18: years Male 15 40991.33
16 TO 18: years Female 1 60000.00
19 TO 21: years Male 22 37772.73
19 TO 21: years Female 6 34666.67
22 TO 25: years Male 25 43672.00
22 TO 25: years Female 6 25500.00
26 TO 30: years Male 11 29118.18
26 TO 30: years Female 4 23625.00
31 TO 99: years Male 5 55000.00
Invalid Skip Male 1 12000.00
Valid Skip Male 688 57980.28
Valid Skip Female 825 41989.56

Men outnumber women in being incarcerated regardless of age. Furthermore, those who were never incarcerated earn more than those who were. The one exception is men who were incarcerated between the ages of 31 and 99, as they earn more than women who were never incarcerated.

Urban vs. Rural

Urban vs. Rural Average Income Count
Rural 46556.64 412
Urban 49299.04 1197

The first table highlights that the urban dwellers have the highest average income. The violin plot also confirms this but shows that there may be outliers due to the small density of individuals towards the top of the maximum income.

Table also highlights that the majority of our non-missing respondents are urban dwellers. This means that we will have a closer approximation to the true value than in the case of rural respondents. However, since there is still a large sample for rural dwellers, we can still employ generalization.

The above violin plot shows that those in rural settings have a higher concentration of people at a lower income. Both female and male urban dwellers have a higher maximum income than their rural counterparts.

Include basic summaries such as count, etc. Tie into predictions of what kind of regression/how variable fits into regression

Total Children

Total Children Gender Count Average Income
0 Male 297 52624.14
0 Female 230 47496.58
1 Male 208 56235.50
1 Female 185 44808.45
2 Male 163 61102.33
2 Female 259 39855.14
3 Male 78 59915.00
3 Female 128 35954.13
4 Male 17 49705.88
4 Female 28 28982.14
5 Male 4 57750.00
5 Female 12 17250.00

We used the first table to drop all number of children groupings that did not have a sample size of at least 10. The above table indicates that there are more than 10 people in each group. Though we have not tested for statistical significance, the first table indicates that the income gap between men and women increases based on how many biological children a respondent has. Our conduction of the regression will allow us to see whether these differences are statistically significant.

The bar chart shows that there are more women respondents for all categories except for the 0 category. This may be due to women having a higher likelihood of being single parents. This graph in addition to the first table indicate that there is still a large enough sample size across categories.

Spouse Income

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   28000   42000   50926   64250  283919

The above table summarizes the distribution of spouse income before removing topcoding. It shows that the data is highly skewed to the right with a mean of $50,926 and a median of $42,000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   28000   42000   47961   62000  173000

Similar to the income variable, spouse’s income is topcoded. The top 2% of values are based on average of $173000. We see that though the average spouse income is $47,961, the data is skewed to the right with a median of $42,000. This data is therefore less skewed than the data before topcoding. This suggests that our analysis will not be as skewed towards higher incomes, but will be more accurate with the median, which closely represents where the majority of respondents lie, improving the generalizability of our results.

Marital Status Sex Count Average Income
Never married Male 139 45939.11
Never married Female 133 39477.95
Married Male 573 59536.42
Married Female 636 42752.24
Separated Male 9 41888.89
Separated Female 12 56125.00
Divorced Male 46 46923.85
Divorced Female 59 34193.14
Widowed Female 2 12500.00

After we remove the top coded variables, we see that an increase in spousal income is associated with an increase of individual income. From the correlation coefficient of 0.19 for income and spousal income, we can see that overall there is a weak but positive correlation between spousal income and income. This coincides with the graphs of spousal income versus income for both men and women. We would have to test for the apparently higher and stronger correlation between spouse income and income for men versus women to see if there is a statistically significant difference.

Marital Status

Marital Status Gender Count Average Income
Never married Male 199 46756.46
Never married Female 181 39019.91
Married Male 746 58716.65
Married Female 892 41847.40
Separated Male 12 41833.33
Separated Female 21 56166.67
Divorced Male 61 48093.39
Divorced Female 78 35716.60
Widowed Female 3 25000.00
NA Male 6 53500.00
NA Female 1 21000.00
## NULL

The first table shows that, after removing invalid and topcoded values from our analysis, we only had 3 widowed females and no widowed males in our analysis. The above graph shows that it is hard to determine whether the differences in income by marital status are statistically significant. This could lead to the other variables absorbing the effect that the widowed factor would have had on income when we run a regression. This could cause standard error or inaccurately increase the effect size.

We see in the count of respondents table that the majority of respondents are married. By looking at average income in the tabular summary by marital status and sex, we can see that there does seem to be an income gap between men and women based on marital status.

Methodology for Linear Regression Model

Missing Values

As mentioned in the introduction, in the majority of the cases, we unfortunately had to simply remove the missing values. We had to do this as the dataset did not provide clear notes on how to interpret missing values. For example, in the case of childhood financial difficulty variable, 1084 of the responses were marked as “Valid Skip” but there was no note as to how this response differed from the response of “No.” In order to avoid conflating “No” with the possibly unclear response of “Valid Skip” we simply had to drop those 1084 rows. Dropping these rows will weaken the resulting interpretation and generalizability of our analysis not only because it reduces our sample size, but more importantly, it could make our overall sample less randomized.

This could be the case if those respondents who had the “Valid Skip” value were not a random sub-sample; as we have no reason to believe they were random, it is unlikely they were random. While it is likely that the initial sampling was done randomly from the overall population, it is unlikely that those with the “Valid Skip” response are also a random sub-sample of the dataset. In turn, if our overall sample is no longer random (or even less randomized), this would mean we are less able to generalize our results and make predictions regarding the general population.

Topcoded Variables

For the topcoded variables of income and spouse income, we ended up removing the topcoded rows entirely. We did this for two reasons. Firstly, it helped make our graphs more clear as it reduced their skewness. For example, for our violin plots when describing the urban vs. rural variable, we noticed that the presence of the topcoded rows skewed our plots towards a higher income, thereby making it harder to interpret the larger distributions near the lower end of the income axis. Secondly, we noticed that removing the topcoding did not significantly change our regression results in terms of which variables were and were not significant.

Plots We Tried

We produced a scatter plot of spouse income and the individual’s income. We expected to find that as the man’s income increased, his spouse’s income would decrease. In conjunction, we would expect the women’s income to decrease as her spouse’s income increased. This did not hold true based on our graph of statistical significance.

We also produced a scatter plot of age at first incarceration and income. We expected that those who had an earlier age of first incarceration would have a lower income than those who were incarcerated at a later age. Instead, we found that the overall trend of the scatter plot was a horizontal line, suggesting that regardless of when one is first incarcerated, they can expect to have a similar, below average income.

Variable Selection

To begin our variable selection process, we first began constructing linear regression model comparisons with and without each variable. From this, we found that only the urban-rural variable is not a significant predictor in the regression of income on sex.

To narrow down what variables to use in our analysis, we examined whether there was collinearity between variables. If we have variables with collinearity, then we have difficulty interpreting how those variables uniquely determine the results as they are associated with another variable. Collinearity therefore impacts the accuracy of our interpretations.

To determine collinearity, we run the following pairs plot.

The above plot shows that there is not a specific two-variable combination in which a large proportion only falls into one of the combined categories. For example, for men and women, we do not see all women fall into rural and all men fall into urban. This testament holds for all the variables presented in the plot. Since there does not seem to be any cases in which knowing the value of one variable means we know the value of the other, we can confidently use a combination of these variables in our regression analyses.

After doing the graphical and tabular summaries along with the above pairs plot, we felt that the following four variables had the greatest variance income between genders.

  • Highest degree
  • Spouse income
  • Total children
  • Occupation
  • Industry

While the other variables provided informative descriptions, we decided to not include them either because they were most applicable only to a small sub-sample of the dataset (such as childhood financial difficulty, disability, and age incarcerated), or because we simply did not want to over-complicate the following regressions with too many variables.

Findings from Regression Analysis

We utilized the above methodology to determine what variables we want to test and then include in our analysis.

For our variables, we used the following baselines: - Sex: Male- to highlight how much women are disadvantage in terms of income - Highest degree: None- to see the impact of educational capital - Total children: 0- to see the association between the number of children - Occupation: Office and Administrative Support Workers- it’s the most populous category so will improve generalizability - Industry: Educational Health and Social Services- it’s the most populous category so will improve generalizability

Regressing Income on Sex

We first begin with an assessment of the relationship between sex and income. We start with the following model:

Income = Intercept + \(\beta\) * sex


Estimate Std. Error t value Pr(>|t|)
(Intercept) 56108.74 1003.466 55.915 0
sexFemale -14354.72 1387.154 -10.348 0

Female represents the baseline for sex. From the above model, we see that women tend to earn an average of 14354.72 less than men. With a p-value of 0, this is significant at the 5% significance level.The next step is to determine whether this statistically significant difference holds when including the effects of other variables.

Adding highest degree and urban vs. rural to determine significance

In this section, we look at at the impact of adding the variables education and urban rural to our regression to determine if it changes the effect size that sex has on income. First we look at the regression model:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 36420.216 3212.338 11.338 0.0000
sexFemale -16952.485 1258.239 -13.473 0.0000
highest.degree.attained.2017High School Diploma 10470.646 3135.558 3.339 0.0009
highest.degree.attained.2017GED 4087.557 3696.377 1.106 0.2690
highest.degree.attained.2017Associate/Junior College 21386.823 3712.514 5.761 0.0000
highest.degree.attained.2017Bachelor’s Degree 30283.244 3184.942 9.508 0.0000
highest.degree.attained.2017Master’s Degree 39244.067 3727.263 10.529 0.0000
highest.degree.attained.2017Professional Degree (DDS, JD, MD) 64721.585 7811.357 8.286 0.0000
urban.ruralUrban 2468.847 1432.018 1.724 0.0849

Through this table we see that the sexMale coefficient is statistically significant at the 5% significance level. We also see that the coefficients for highest degree attained for Bachelor’s, Master’s, PH.D., and Professional Degrees are also statistically significant at the 5% significance level in comparison to the baseline of no education attained. This indicates that gender in addition to highest degree attained can influence income. The urban coefficient suggests that those in urban areas earn $2468.85 more on average than those in rural settings, all other factors being constant. However, this difference is not statistically significant at the 5% level.

To check whether the highest degree attained variable is a significant association within the above model, we remove the urban-rural variable and compare the model to that of income on sex.

By running an ANOVA for the model with and without education, we see that the p-value 0 is statistically significant and highest education attained is a important variable for modeling and explaining income.

Therefore, we run an ANOVA below to determine whether the urban-rural variable is significant.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + urban.rural
## Model 2: income ~ sex
##   Res.Df        RSS Df   Sum of Sq      F Pr(>F)
## 1   1606 1.2392e+12                             
## 2   1607 1.2411e+12 -1 -1900207595 2.4626 0.1168

The urban coefficient was not significant in the model. The ANOVA confirms this as the p=value is 0.1168, which is not statistically significant at the 5% level. The urban-rural variable therefore is not a significant variable in modeling income. We therefore removed it from our next regression.

Adding occupation and determining significance

In this section, we look at at the impact of adding occupation to determine if it changes the effect size that sex has on income. First we look at the regression model:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 49317.101 1970.242 25.031 0.0000
sexFemale -13659.769 1433.295 -9.530 0.0000
occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 20432.184 2429.861 8.409 0.0000
occupation.2017MANAGEMENT RELATED 23247.380 2923.719 7.951 0.0000
occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS 23238.032 3821.893 6.080 0.0000
occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS 37199.717 6057.151 6.141 0.0000
occupation.2017PHYSICAL SCIENTISTS 4573.703 7753.942 0.590 0.5554
occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS 20222.637 8105.136 2.495 0.0127
occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 9164.441 4175.717 2.195 0.0283
occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 19935.065 7155.707 2.786 0.0054
occupation.2017TEACHERS 3410.348 2853.296 1.195 0.2322
occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS -11099.937 6688.105 -1.660 0.0972
occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 13002.774 5392.314 2.411 0.0160
occupation.2017MEDIA AND COMMUNICATION WORKERS 7949.137 5852.422 1.358 0.1746
occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS 25317.998 3497.833 7.238 0.0000
occupation.2017HEALTH CARE TECHNICAL AND SUPPORT -6570.354 3472.982 -1.892 0.0587
occupation.2017PROTECTIVE SERVICE 11086.474 4118.175 2.692 0.0072
occupation.2017FOOD PREPARATIONS AND SERVING RELATED -14833.094 3701.140 -4.008 0.0001
occupation.2017CLEANING AND BUILDING SERVICE -11699.265 4968.747 -2.355 0.0187
occupation.2017PERSONAL CARE AND SERVICE WORKERS -13655.686 3806.137 -3.588 0.0003
occupation.2017SALES AND RELATED WORKERS 4664.295 2702.431 1.726 0.0845
occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS 545.931 3552.038 0.154 0.8779
occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS 1568.905 3920.260 0.400 0.6891
occupation.2017PRODUCTION AND OPERATING WORKERS -4602.159 5891.500 -0.781 0.4348
occupation.2017SETTER, OPERATORS, AND TENDERS -1396.706 4218.488 -0.331 0.7406
occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS -7297.373 3331.757 -2.190 0.0287

Through this table we see that only the sexMale coefficient is statistically significant, with a p-value of 0. The lack of significance for the occupation factors may be due to the existence of collinearity as occupation.

Therefore, we run an ANOVA below to determine whether adding occupation is significant.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + occupation.2017
## Model 2: income ~ sex
##   Res.Df        RSS  Df   Sum of Sq      F    Pr(>F)    
## 1   1583 9.9547e+11                                     
## 2   1607 1.2411e+12 -24 -2.4566e+11 16.277 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

None of the occupation factors were significant, so we tried removing it, but as the ANOVA test showed a significant result of 0, this tells us that the occupation.2017 variable is an important variable in modeling income.

For the last regression, we will be looking at the impact of these variables on income: gender, highest degree, occupation, and total children.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 31150.119 3850.066 8.091 0.0000
sexFemale -12400.533 1364.599 -9.087 0.0000
highest.degree.attained.2017High School Diploma 6753.478 3016.304 2.239 0.0253
highest.degree.attained.2017GED 859.210 3508.766 0.245 0.8066
highest.degree.attained.2017Associate/Junior College 12912.009 3596.958 3.590 0.0003
highest.degree.attained.2017Bachelor’s Degree 22897.515 3199.716 7.156 0.0000
highest.degree.attained.2017Master’s Degree 31909.517 3752.446 8.504 0.0000
highest.degree.attained.2017Professional Degree (DDS, JD, MD) 49677.561 7593.481 6.542 0.0000
occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 15840.727 2325.775 6.811 0.0000
occupation.2017MANAGEMENT RELATED 14217.206 2822.122 5.038 0.0000
occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS 18240.352 3657.971 4.986 0.0000
occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS 26028.188 5759.531 4.519 0.0000
occupation.2017PHYSICAL SCIENTISTS -11756.214 7315.070 -1.607 0.1082
occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS 12498.840 7576.065 1.650 0.0992
occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 565.146 4083.932 0.138 0.8900
occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 11658.653 6866.191 1.698 0.0897
occupation.2017TEACHERS -2421.154 3088.787 -0.784 0.4332
occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS -12875.246 6337.332 -2.032 0.0424
occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 9454.633 5095.827 1.855 0.0637
occupation.2017MEDIA AND COMMUNICATION WORKERS 4977.600 5568.606 0.894 0.3715
occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS 20431.728 3572.076 5.720 0.0000
occupation.2017HEALTH CARE TECHNICAL AND SUPPORT -1188.352 3468.331 -0.343 0.7319
occupation.2017PROTECTIVE SERVICE 8840.900 4133.087 2.139 0.0326
occupation.2017FOOD PREPARATIONS AND SERVING RELATED -823.155 3989.492 -0.206 0.8366
occupation.2017CLEANING AND BUILDING SERVICE -1213.177 4774.095 -0.254 0.7994
occupation.2017PERSONAL CARE AND SERVICE WORKERS -9802.235 3694.125 -2.653 0.0080
occupation.2017SALES AND RELATED WORKERS 7607.294 2793.989 2.723 0.0065
occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS 3954.315 4264.581 0.927 0.3539
occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS 6977.528 3724.404 1.873 0.0612
occupation.2017PRODUCTION AND OPERATING WORKERS -3558.503 5684.112 -0.626 0.5314
occupation.2017SETTER, OPERATORS, AND TENDERS 1336.069 4216.560 0.317 0.7514
occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS -2503.235 3272.196 -0.765 0.4444
children.total1 2974.023 1558.332 1.908 0.0565
children.total2 1565.810 1556.161 1.006 0.3145
children.total3 3022.314 1951.261 1.549 0.1216
children.total4 1230.200 3697.200 0.333 0.7394
children.total5 -9984.058 5950.261 -1.678 0.0936
industry.2017CONSTRUCTION 7034.963 3775.896 1.863 0.0626
industry.2017MANUFACTURING 6512.208 2983.576 2.183 0.0292
industry.2017WHOLESALE TRADE 10525.835 4061.888 2.591 0.0096
industry.2017RETAIL TRADE -4246.831 3087.397 -1.376 0.1692
industry.2017TRANSPORTATION AND WAREHOUSING 10015.124 3796.602 2.638 0.0084
industry.2017INFORMATION AND COMMUNICATION 6221.390 5316.231 1.170 0.2421
industry.2017FINANCE, INSURANCE, AND REAL ESTATE 6825.431 2783.743 2.452 0.0143
industry.2017PROFESSIONAL AND RELATED SERVICES 1595.929 2541.819 0.628 0.5302
industry.2017ACS SPECIAL CODES 14390.481 3108.566 4.629 0.0000
industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -6381.503 3038.542 -2.100 0.0359
industry.2017OTHER SERVICES 531.425 3553.856 0.150 0.8812
industry.2017PUBLIC ADMINISTRATION 9080.857 3060.031 2.968 0.0030

In this model the baseline intercept refers to a female, with a highest degree of None (which means less than a high school diploma), an occupation in Office and Administrative Support Workers, 0 total children, Educational Health and Social Services industry, and one who lives in an urban community. This individual has a predicted income of $31150.12. As all the variables in this model are categorical, one would simply add or subtract the given coefficients in order to change the interpretation. For example, if one wanted to change this person’s gender and instead have them work in the construction industry, one would first increase their salary by $12401 (for becoming male) but then subtract $3954 (for the construction occupation). As such, their new salary would be $47504.97

In this model, the following factors of the variables are significant:

  • Gender: Male
  • Highest Degree: Associates degree and up
  • Occupation: Education, Training, and Library Workers, Health Diagnosis and Treating Practitioners, Protective Service, Personal Care and Service Workers, Sales and Related Workers
  • Industry: Manufacturing; Public Administration; Entertainment, Accommodations, and Food Services; ACS Special Codes, Finance, Insurance and Real Estate, Transportation and Warehousing, Wholesale Trade

As only one of the total children level’s was statistically significant, let’s try removing the total children variable from the model and conduct an ANOVA test to see if there is a difference between these two models: one with it and one without it.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     children.total + industry.2017
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1   1565 8.3645e+11                           
## 2   1560 8.3195e+11  5   4.5e+09 1.6876 0.1344

As the ANOVA test has provided us a value of 0.1344, which is slightly greater than 0.05, we should conclude that the total children variable does not significantly improve the model and that using the model without the total children variable would be adequate.

Difference in Means for Industry

To look at statistical significance in the income gap between men and women, we create a difference in means visual for industry.

The greatest gender differences in mean income are found in the industries of public administration, finance, insurance, and real estate, and acs special codes. Furthermore, there is not a statistically significant difference in the differences between industries, i.e., they may all be around the same.

Adding Highest Degree with Gender as an Interaction Term

Estimate Std. Error t value Pr(>|t|)
(Intercept) 29122.500 4322.830 6.737 0.0000
sexFemale -7546.351 5989.590 -1.260 0.2079
highest.degree.attained.2017High School Diploma 8061.460 3697.840 2.180 0.0294
highest.degree.attained.2017GED 1539.498 4327.055 0.356 0.7221
highest.degree.attained.2017Associate/Junior College 14977.253 4623.190 3.240 0.0012
highest.degree.attained.2017Bachelor’s Degree 24487.805 3938.623 6.217 0.0000
highest.degree.attained.2017Master’s Degree 36036.210 4927.199 7.314 0.0000
highest.degree.attained.2017Professional Degree (DDS, JD, MD) 66732.303 11158.016 5.981 0.0000
occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 15725.311 2332.803 6.741 0.0000
occupation.2017MANAGEMENT RELATED 14007.383 2838.636 4.935 0.0000
occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS 17937.501 3668.245 4.890 0.0000
occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS 25140.670 5812.275 4.325 0.0000
occupation.2017PHYSICAL SCIENTISTS -12033.529 7321.272 -1.644 0.1005
occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS 12720.662 7587.410 1.677 0.0938
occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 734.936 4099.695 0.179 0.8578
occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 12309.761 6874.780 1.791 0.0736
occupation.2017TEACHERS -2187.736 3106.671 -0.704 0.4814
occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS -12654.468 6352.197 -1.992 0.0465
occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 9350.604 5103.843 1.832 0.0671
occupation.2017MEDIA AND COMMUNICATION WORKERS 5090.576 5595.179 0.910 0.3631
occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS 20607.523 3609.104 5.710 0.0000
occupation.2017HEALTH CARE TECHNICAL AND SUPPORT -1084.312 3474.870 -0.312 0.7550
occupation.2017PROTECTIVE SERVICE 9253.575 4169.016 2.220 0.0266
occupation.2017FOOD PREPARATIONS AND SERVING RELATED -875.810 3997.626 -0.219 0.8266
occupation.2017CLEANING AND BUILDING SERVICE -1264.085 4788.769 -0.264 0.7918
occupation.2017PERSONAL CARE AND SERVICE WORKERS -9889.507 3702.539 -2.671 0.0076
occupation.2017SALES AND RELATED WORKERS 7526.588 2800.234 2.688 0.0073
occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS 4278.379 4309.142 0.993 0.3209
occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS 7244.627 3767.122 1.923 0.0546
occupation.2017PRODUCTION AND OPERATING WORKERS -3631.021 5706.412 -0.636 0.5247
occupation.2017SETTER, OPERATORS, AND TENDERS 1246.310 4231.079 0.295 0.7684
occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS -2130.291 3314.378 -0.643 0.5205
children.total1 3043.007 1559.458 1.951 0.0512
children.total2 1559.399 1558.995 1.000 0.3173
children.total3 3015.890 1959.927 1.539 0.1241
children.total4 924.775 3715.019 0.249 0.8034
children.total5 -10521.018 5964.271 -1.764 0.0779
industry.2017CONSTRUCTION 7647.922 3790.084 2.018 0.0438
industry.2017MANUFACTURING 7126.171 3003.763 2.372 0.0178
industry.2017WHOLESALE TRADE 10995.679 4069.815 2.702 0.0070
industry.2017RETAIL TRADE -3895.169 3094.730 -1.259 0.2083
industry.2017TRANSPORTATION AND WAREHOUSING 10523.842 3805.112 2.766 0.0057
industry.2017INFORMATION AND COMMUNICATION 6811.518 5344.396 1.275 0.2027
industry.2017FINANCE, INSURANCE, AND REAL ESTATE 7299.267 2792.241 2.614 0.0090
industry.2017PROFESSIONAL AND RELATED SERVICES 2056.298 2553.970 0.805 0.4209
industry.2017ACS SPECIAL CODES 14347.390 3112.140 4.610 0.0000
industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -6243.466 3054.412 -2.044 0.0411
industry.2017OTHER SERVICES 975.589 3563.656 0.274 0.7843
industry.2017PUBLIC ADMINISTRATION 9474.393 3069.222 3.087 0.0021
sexFemale:highest.degree.attained.2017High School Diploma -4189.375 6163.425 -0.680 0.4968
sexFemale:highest.degree.attained.2017GED -2675.755 7204.382 -0.371 0.7104
sexFemale:highest.degree.attained.2017Associate/Junior College -5546.329 7238.672 -0.766 0.4437
sexFemale:highest.degree.attained.2017Bachelor’s Degree -4529.489 6333.575 -0.715 0.4746
sexFemale:highest.degree.attained.2017Master’s Degree -8776.521 7361.676 -1.192 0.2334
sexFemale:highest.degree.attained.2017Professional Degree (DDS, JD, MD) -31525.150 14904.481 -2.115 0.0346

As several of the levels of the variable education were statistically significant, as was the gender male variable. We decided to run an interaction model. However, it seems that none of the interaction terms (those that contain semicolons near the bottom) are statistically significant. However, to test this for sure, we can conduct an ANOVA test.

Therefore, we run an ANOVA below to determine whether highest degree with sex is significant.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     children.total + industry.2017 + sex:highest.degree.attained.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     children.total + industry.2017
##   Res.Df        RSS Df   Sum of Sq      F Pr(>F)
## 1   1554 8.2894e+11                             
## 2   1560 8.3195e+11 -6 -3006162269 0.9393 0.4656

As the ANOVA test provided a p-value of 0.4656, this tells us that the new interaction model is not better and predicting income than the old model. We know this as the test statistic is greater than 0.05. As such, it is better to simply use the older model.

Adding Spouse Income

## Analysis of Variance Table
## 
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     spouse.income + industry.2017
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1   1565 8.3645e+11                                  
## 2   1564 8.0841e+11  1 2.8041e+10 54.25 2.844e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27224.225 3725.551 7.307 0.0000
sexFemale -13708.517 1339.199 -10.236 0.0000
highest.degree.attained.2017High School Diploma 5531.115 2952.943 1.873 0.0612
highest.degree.attained.2017GED 533.256 3443.581 0.155 0.8770
highest.degree.attained.2017Associate/Junior College 11006.864 3527.716 3.120 0.0018
highest.degree.attained.2017Bachelor’s Degree 19540.780 3142.300 6.219 0.0000
highest.degree.attained.2017Master’s Degree 28452.589 3679.541 7.733 0.0000
highest.degree.attained.2017Professional Degree (DDS, JD, MD) 44981.343 7487.233 6.008 0.0000
occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 15457.104 2287.817 6.756 0.0000
occupation.2017MANAGEMENT RELATED 12888.630 2782.063 4.633 0.0000
occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS 19438.892 3603.611 5.394 0.0000
occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS 25293.579 5656.872 4.471 0.0000
occupation.2017PHYSICAL SCIENTISTS -12155.443 7197.949 -1.689 0.0915
occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS 9277.854 7444.613 1.246 0.2129
occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 1364.657 4014.649 0.340 0.7340
occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 11160.126 6749.957 1.653 0.0985
occupation.2017TEACHERS -1521.828 3025.818 -0.503 0.6151
occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS -12408.687 6217.108 -1.996 0.0461
occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 8123.612 5016.040 1.620 0.1055
occupation.2017MEDIA AND COMMUNICATION WORKERS 3721.082 5477.310 0.679 0.4970
occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS 20287.964 3514.028 5.773 0.0000
occupation.2017HEALTH CARE TECHNICAL AND SUPPORT -325.224 3415.503 -0.095 0.9242
occupation.2017PROTECTIVE SERVICE 9475.883 4061.561 2.333 0.0198
occupation.2017FOOD PREPARATIONS AND SERVING RELATED 362.625 3919.659 0.093 0.9263
occupation.2017CLEANING AND BUILDING SERVICE -394.086 4697.288 -0.084 0.9331
occupation.2017PERSONAL CARE AND SERVICE WORKERS -10455.879 3631.951 -2.879 0.0040
occupation.2017SALES AND RELATED WORKERS 7206.923 2748.464 2.622 0.0088
occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS 4428.304 4191.003 1.057 0.2908
occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS 7099.678 3663.910 1.938 0.0528
occupation.2017PRODUCTION AND OPERATING WORKERS -2739.127 5589.974 -0.490 0.6242
occupation.2017SETTER, OPERATORS, AND TENDERS 669.241 4149.261 0.161 0.8719
occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS -2430.390 3218.669 -0.755 0.4503
spouse.income 0.156 0.021 7.365 0.0000
industry.2017CONSTRUCTION 6778.906 3709.183 1.828 0.0678
industry.2017MANUFACTURING 7993.682 2934.881 2.724 0.0065
industry.2017WHOLESALE TRADE 12041.343 3992.170 3.016 0.0026
industry.2017RETAIL TRADE -3635.109 3029.722 -1.200 0.2304
industry.2017TRANSPORTATION AND WAREHOUSING 10891.397 3733.747 2.917 0.0036
industry.2017INFORMATION AND COMMUNICATION 7991.496 5231.182 1.528 0.1268
industry.2017FINANCE, INSURANCE, AND REAL ESTATE 7858.467 2737.953 2.870 0.0042
industry.2017PROFESSIONAL AND RELATED SERVICES 2124.860 2490.778 0.853 0.3937
industry.2017ACS SPECIAL CODES 15379.004 3057.980 5.029 0.0000
industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -5880.957 2982.318 -1.972 0.0488
industry.2017OTHER SERVICES 1683.793 3495.890 0.482 0.6301
industry.2017PUBLIC ADMINISTRATION 9969.962 3008.193 3.314 0.0009

For our last regression, we decided to add spouse income to the last model. We see that the spouse.income variable has a p-value that is significant at the 5% level.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 + 
##     spouse.income + industry.2017
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1   1565 8.3645e+11                                  
## 2   1564 8.0841e+11  1 2.8041e+10 54.25 2.844e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To be certain in our results, we conduct an ANOVA test. Our p-value of 0 is also significant at the 5% level.

We ran the following diagnostic plots below to compare the fit of our models.

Comparing Goodness of Fit Between Models via Diagnostic Plots

Now that we have done multiple iterations of our regression, and have determined what variables to add, it would be useful to compare the fit of our data between our first and final regression.

The first model we are comparing is the regression of income on sex:
Income = Intercept + \(\beta\) * sex


The second model that we are comparing is the regression of income on :
Income = Intercept + \(\beta\) * sex + \(\beta\) * highest education attained + \(\beta\) * occupation + \(\beta\) * spouse income + \(\beta\) * industry


Diagnostics for Model 1

Residuals vs. Fitted The residuals versus fitted plot indicates that the residuals do not have constant variance. The clear pattern of two distinct lines suggests the linear model is not appropriate for this data.

Normal QQ plot Since this plot indicates whether the quantiles of the normal distribution matches those of the response variable, we can be certain that the response variable is not normally distributed. Especially at the higher tail end, the quantiles of the response variable do not overlay that of the normal distribution. This indicates that our p-values may not be believable.

Scale-location plot The scale location is similar to the residuals versus fitted. Though for both there is a straight line, there are two distinct lines of data on the plot. However, the ideal does indicate that there should be a constant slope.

Residuals vs Leverage.
The residuals versus leverage plot for Model 1 does not even show a Cook’s line. Therefore there are no observations that have both a high residual and are influential (high leverage). This means there are no outliers in this set.

Diagnostics for Model 2

Residuals vs. Fitted The residuals in this plot have a less defined pattern compared to the first model. They also have a more constant variance, which indicates that a linear model is more appropriate for model 2 versus Model 1.

Normal QQ plot Similar to the Normal QQ plot in Model 1, the tail ends of the response variable’s distribution are not perfectly aligned with that of the normal distribution plot. However, we do see that the lower tails of Model 2’s Normal QQ plot are more closely aligned with the normal distribution. However, since the upper tail of Model 2’s plot is still not as closely aligned, we should still believe the p-values with caution.

Scale-location plot The scale-location plot for Model 2 does not match the ideal of a horizontal line that shows constant variance for residuals. Compared to Model 1, it has a less discernible pattern but does not have a horizontal slope for the line.

Residuals vs Leverage
Unlike in Model 1, the Cook’s line shows up in Model 2. This means that there are more observations with a higher leverage and residuals than in Model 1. However, since there are no observations with both high leverage and residuals, according to the Cook’s line, we do not have any outliers in our Model 2.

Diagnostics for Model 3

The diagnostic plots for model 3 are identical to the plots of model 2, so please see the prior interpretations.

Discussion

Main Conclusions

Through our analysis, we sought to answer the following question:

Is there a significant difference in income between men and women? Does the difference vary depending on other factors?

In response the the first question, we found that the answer was yes. Men, on average, earn $14354.72 than women. More precisely, the average man in the US makes $56108.74 and the average women makes $41754.02. This means that the average women makes 0.74 cents for every dollar a man makes. This linear regression also had a p-value of 0 which shows that it is significant at the 5% level.

As for the second question, the difference does appear to vary depending on other factors, though not very much. Even when accounting for the variables of highest degree attained, occupation, and industry, and spouse income, we were only able to reduce the income gap by 646.2, which translates to a 3% reduction from a 26% income gap to a 23% income gap. This still leaves 23% unaccounted for by our current model.

On the one hand, this could provide quantitative evidence for gender discrimination against women. On the other hand, this could simply mean there are other confounding variables we have yet to account for. Some examples of these variables include:

  • Work hours: Women may tend to prefer more flexible work hours than males
  • Salary negotiation: It is likely that men are more likely to negotiate up their salaries than women
  • Maternity leave: Maternity leave is also more widely used than paternity leave and there is no federal law mandating paid maternity leave

As such, we conclude that as there is still 23% of the wage gape that is unexplained by our model, we are not confident that we have accounted for all other possible confounding variables and so we cannot be certain that this remaining gap is caused by discrimination.

Limitations and Confidence

The limitations of our analysis were shown via our diagnostic plots. For our model of only income with gender, while we found that our p-value was statistically and practically significant, the diagnostic plots showed that our model lacked constant variance, thereby reducing our confidence in our p-value. For our model of gender with highest degree, occupation and total children, while this was an improvement on the earlier model in terms of constant variance, there could still be room for improvement as our scale-location plot was not as horizontal as it could have ideally been.

Overall, however, our initial model is believable as our finding of 26% for the overall wage gap is near estimates from other researchers. However, our second model is less believable as other researchers have been able to include other variables and explain a greater part of the wage gap then we have done. As such, we would not feel confident presenting our analysis to policy makers. If we were to present to policymakers now without accounting for confounding variables, our main policy recommendation will likely be quite different compared to if we had accounted for all possible confounding variables.

On a final note, it is clear that more research and more legislation must be done. If we were to estimate the lifetime earnings a woman loses because of the wage gap, we conservatively estimate this to be around $682950. This makes our findings quite practically significant.